Introduction (20 points)

Hypothesis: If HIV-2 binds less efficiently to the CD4 cell receptors compared to HIV-1, then the HIV-1 and the CD4 cell surface have better protein-protein interaction compared to HIV-2 and the CD4 cell surface.

Question: Why does HIV-1 bind more efficiently to the CD4 cell surface compared to HIV-2?

Backround: There are many strains and 2 main types of HIV. There are HIV-1 and HIV-2. HIV-1 is the type that the majority of people have, and is known to be more transmissible with a quicker decline in health. On the other hand, there are fewer people living with HIV-2, it is less transmissible, and the decline in health is slower. So why is HIV-1 more transmissable with worse symptoms than HIV-2? It could possibly be due to differences in the receptors to the CD4 T cells. The CD4 T cells are white helper cells that look for foreign bodies that can get people sick. HIV infects these CD4 T cells by binding and fusing to the cell with proteins on HIV cell surface called gp120 and gp41.The gp120 structure of HIV-1 and HIV-2 are actually similar in structure although their sequences look very different (Davenport). So I am going to investigate the gp41 protein in this project, because a possible structural diffrence in this protein could be the cause of the differences in symptoms and transmissability of the two kinds of HIV. The fasta files were sourced from NCBI website and the PDB files were sourced from RCSB website.

.pdb files from Unitprot and sequences from NCBI wedsite and on the READme of the github. ADD MORE

Loading in Packages (15 points)

Seen in the chunk below, above each code for importation is a description of the packages.

Links to websites helping installation of packages on the main commandline are linked below, above the package being imported.

Performing Bioinformatics Analysis (20 points)

The code in the 2 cells below is from: https://stackoverflow.com/questions/58673944/get-all-bounds-lengths-from-pdb. This code measures the lengths of the bounds in the pdb files which is the lengths between atoms. This is a type of protein measurement, that useful to compare the distances of the atoms.This can tell us where differences may lie that account for HIV-1 binding to CD4 cell surface compared to HIV-2.

Pairwise sequence alignment between the genomes of HIV-1 and HIV-2. This will show where there are differences in the genome that could account for the differences in efficiency in binding to CD4 T cells. I was originally going to focus on the env gene especially looking at the gp120 protein but now I think I will focus on the gp41 protein after reading one of my papers. One of the papers suggested that the gp120 protein looks structurally similar between the two types of HIV, so I will focus on the fusing receptor. This protein is also coded by the env gene and is also important for the virus to bind to the CD4 T cell. We will use a dot plot to visualize this comparison between the two sequences.

The data type used in this method will be fasta files read by AlignIO.

Plotting The Results (15 points)

A dot plot is used to analyze the data. A dot plot visualizes the similarities between two sequences, either protein sequences or DNA/RNA sequences. In this case we are comparing the sequences between the two HIV1 and HIV2 gp41 genes to see the similarities and differences between the two sequences.

Homology modeling of the HIV-1 and HIV-2 type. This will predict the 3D structure of our virus/the gene we are focusing on. Protein structure tends to be conserved, so this will give us a visual if these are only really different on the sequence level, or if they have evolved to have proteins that appear different. Again I will be focusing on the gp41 protein and how the proteins structurally appear to be different/similar.

The data type used in this method will be PDB files from RCSB. I will be using the nglVIEW package to model the proteins for visualization.

Seen below.

Comments made above lines of code below.

Code Formatting Requirements (15 points)

A local variable is only used within a function. So there it will be indented under a "def" to create a function, and you cannot use it outside the function. Global variables can be used outside of the function. Variables that were made, were either descibed as local or global in the code above.

In my code above you can see I used Biopython's SeqIO. I also used Numpy and matplotlib that was discussed in class.

Analyzing the Results (15 points)

From the dot plot we can see that there are not that many sequences or nucleotides that the two sequences have in common. This reveals that there are some differences on a sequential or DNA level. From just the dot plot telling us that there are big differences on the DNA level, I can infer that the proteins will appear to be structurally different as well. This is confirmed by the protein visualization. There we see that there is a tail looking strucure that HIV2 has that HIV1 lacks. This could be a possible reason as to why HIV-1 binds better to the cell surface of the CD4 T cells. More research on what that structure means could be useful in determining how it plays a role in the binding to the cell.